Introduction to Open Data Science - Course Project
About the pRoject
Greetings, this is my page for the course
“Introduction to open data science, 5 ECTS”. I found this
course by a strong recommendation from my friend. I have 2 years of
experience with R, but now want to take it to the next
level with Git. I’m thrilled to learn proper version
control and use of Github which promotes sharing/publishing code and
perhaps networking with other R coders. Alongside these hopes, I
anticipate loads of fun from this course!
Lastly, check out my GitHub repository.
Let’s enjoy this interesting course!
1 Start me up!
R for Health Data Science book
This book is a blast to read on the couch from iPad.
The Exercise Set 1
I completed the chapters 1–5 and was familiar with the substance, but there were some nifty functions that I’ll probably try out soon.
2 Regression and model validation
This part describes the statistical analysis of
learning2014 dataset.
Instructions
Describe the work you have done this week and summarize your learning.
- Describe your work and results clearly.
- Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods.
- Assume the reader has no previous knowledge of your data or the more advanced methods you are using.
Setup
First let’s load our packages and some custom functions.
# Run time
start.time <- Sys.time()
# Packages
pacman::p_load(tidyverse, GGally, patchwork, ggfortify)
## Defining some custom functions
# Colors
colorpair <- c("orangered", "dodgerblue")
color_gender <- \ () {scale_color_manual(values = colorpair)}
fill_gender <- \ () {scale_fill_manual(values = colorpair)}
# Beautiful labels
labs <- tribble(
~var, ~lab,
"age", "Age",
"attitude", "Attitude",
"points", "Exam points",
"deep", "Deep learning",
"stra", "Strategic learning",
"surf", "Surface learning"
)
2.1 Get and explore data
Instructions
Read the students2014 data into R –. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it.
Let’s read in the learning2014 dataset and check that it
is read correctly.
# Read data
datasets <- list.files("data/learning/", full.names = T)
newest <- datasets %>% sub('.*_', '', .) %>% as.Date() %>% max()
data <- datasets[grepl(newest, datasets)] %>%
readr::read_csv(show_col_types = F)
# Glance at the data
# data %>% head() %>% knitr::kable()
str(data, give.attr = F)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
# Plot data missingness
naniar::vis_miss(data) + theme_darkmode() + ggtitle("Missingness map")
There are a total of 166 observations and 7 variables. We are dealing with complete data with no missing values. The variables can be divided to the types below (click here for more comprehensive metadata).
Population characteristics
age: Age of the participant (in years) derived from the date of birth.gender: Here gender is coded as a nominal variable with two defined/prevalent values (F = Female, M = Male).
Survey answers
Clearly, the four variables attitude, deep,
stra, and surf represent survey answers on a
likert scale (1–5). attitude captures student’s global
attitude towards statistics. The rest of the variables were computed as
averages of various interrelated questions and describe the traits below
and also described in detail e.g., here.
These traits are important as they may influence success in students’
work, reflected by exam points.
surf: Surface learning; emphasis upon memorising details to complete the assignment. Learning may be more superficial.deep: Deep learning; looking for the overall meaning and attempting to process information in a holistic way.stra: Strategic learning; organizing one’s learning with the objective of achieving a positive outcome. Can involve a combination of both deep and surface learning strategies.
Exam results
points: Results of the statistics exam. I did not find the maximum possible value from the metadata. Before the analysis, students that did not attend the exam (points == 0) were excluded from the dataset.
To note, these data lack subject identifier (ID) variable. However, it may not be needed as we do not have repeated measures.
2.2 Graphical overwiew
Instructions
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)
Let’s visualize the distributions of our dataset with
ggplot2.
p <- data %>% pivot_longer(
cols = colnames(data)[-1],
values_to = "value",
names_to = "var"
) %>%
left_join(labs) %>%
ggplot(aes(x = value, color = lab, fill = lab, y = lab, linetype = gender))
p <- p + ggridges::geom_density_ridges(
aes(point_shape = gender),
rel_min_height = .001, quantile_lines = T, quantiles = .5,
jittered_points = TRUE, position = "raincloud",
point_alpha = .5, point_size = 2, alpha = .3
)
p <- p + scale_shape_manual(values = c(0, 2))
breakx <- c(1, 5, seq(10, 60, by = 10))
p <- p + scale_x_sqrt(breaks = breakx, labels = breakx)
p <- p + labs(
title = "Distributions by gender",
subtitle = "(On a square root x-axis)"
) + xlab("Value") + ylab("")
p
Likert-scale variables are quite evenly spread on
the range of 1–5. It appears that these students are deep
learners more so than surface learners (surf). By visual
inspection, men have more positive attitude towards
statistics than women. Students tend to be of quite
young age although a few older students
skew the distribution heavily to the right. It also seems like men may
be slightly older than female students. Whether these differences are
statistically significant would require further testing…
Numerical summaries
Let’s summarize the numerical variables.
# Summaries for numeric variables
do.call(cbind, lapply(data, summary)) %>%
data.frame() %>% select(-gender) %>%
mutate(across(1:6, \ (x) as.numeric(x))) %>%
mutate(across(1:6, \ (x) round(x, 1))) %>%
t() %>% knitr::kable() # DT::datatable()
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| age | 17.0 | 21.0 | 22.0 | 25.5 | 27.0 | 55.0 |
| attitude | 1.4 | 2.6 | 3.2 | 3.1 | 3.7 | 5.0 |
| deep | 1.6 | 3.3 | 3.7 | 3.7 | 4.1 | 4.9 |
| stra | 1.2 | 2.6 | 3.2 | 3.1 | 3.6 | 5.0 |
| surf | 1.6 | 2.4 | 2.8 | 2.8 | 3.2 | 4.3 |
| points | 7.0 | 19.0 | 23.0 | 22.7 | 27.8 | 33.0 |
Let’s also see the distribution of gender. It appears
that most students (66%) are female. On average, male students are 1.9
years older than female students. Later we’ll probably also see whether
gender modifies the relationship between
attitude/learning and exam points.
data %>% group_by(gender) %>%
summarise(
n = sum(!is.na(gender)),
age_mean_sd = paste0(round(mean(age), 1), " (", round(sd(age), 1), ")")
) %>%
mutate(pct = scales::percent(n / sum(n))) %>%
select(gender, n, pct, everything()) %>%
knitr::kable()
| gender | n | pct | age_mean_sd |
|---|---|---|---|
| F | 110 | 66% | 24.9 (7.4) |
| M | 56 | 34% | 26.8 (8.4) |
Predictors of exam points
It is plausibile that students’ attitude or learning
strategies influence exam points. Therefore, let’s glance
over these potential relationships:
p <- data %>% pivot_longer(
cols = c(attitude, deep, stra, surf),
values_to = "answer",
names_to = "var"
) %>%
left_join(labs) %>%
ggplot(aes(x = answer, y = points, color = lab))
## Joining with `by = join_by(var)`
p <- p + geom_point(alpha = .5, size = .5)
p <- p + geom_smooth(method = "lm", se = F)
p <- p + coord_cartesian(xlim = c(1, 5))
p <- p + labs(title = "Predictors for good or bad exam success") +
xlab("Average answer for the set of questions") + ylab("Exam points")
p
## `geom_smooth()` using formula = 'y ~ x'
Intuitively and as hypothesized, attitude shows positive
relationship with exam points, whereas surface
learning may be negatively associated with points. The
first does not however imply that attitude is causal for
success, as there could be a confounder (e.g., competence) that
associates with attitude and causes good exam points. The link
between surface learning and poor exam points is surprising, as surface
learning is focused precisely around the exam. However, whether these
associations are statistically significant requires testing!
Gender differences
It could be hypothesized that attitude, learning strategies and/or
exam points differ between the genders. Let’s also do some
preliminary hypothesis testing using Wilcoxon signed-ranks test that
requires minimal assumptions.
# Gender bar plot
plotdata <- data %>%
group_by(gender) %>%
reframe(n = n()) %>%
mutate(
pct = scales::percent(n / sum(n))
)
# Calculate 95% CIs
for (i in 1:nrow(plotdata)) {
plotdata$Lower[i] <- Hmisc::binconf(
x = plotdata$n[i], n = sum(plotdata$n)
)[[2]] * sum(plotdata$n)
plotdata$Upper[i] <- Hmisc::binconf(
x = plotdata$n[i], n = sum(plotdata$n)
)[[3]] * sum(plotdata$n)
}
# Plot
p <- plotdata %>%
ggplot(aes(x = gender, y = n, fill = gender))
p <- p + geom_col(width = .5)
p <- p + geom_errorbar(width = .2, aes(ymin = Lower, ymax = Upper))
p <- p + geom_text(aes(y = Upper, label = paste0(pct, "\n")))
p <- p + fill_gender()
p <- p + scale_y_continuous(expand = expansion(mult = c(0, .1)))
p <- p + labs(title = "Gender", subtitle = "95% CI")
p1 <- p
# Facetted sina plots
p <- data %>% pivot_longer(
cols = colnames(data)[-1],
values_to = "value",
names_to = "var"
) %>%
left_join(labs) %>%
ggplot(aes(y = value, x = gender, color = gender))
p <- p + geom_violin(fill = "white", color = NA, alpha = .1)
p <- p + ggforce::geom_sina(size = .5)
p <- p + geom_boxplot(width = .2, color = "white", outlier.color = NA, alpha = .5)
p <- p + stat_summary(fun = mean, shape = 4, color = "white")
p <- p + facet_wrap(. ~ lab, scales = "free")
p <- p + color_gender()
p <- p + labs(
title = "Potential gender differences",
subtitle = "Wilcoxon signed-ranks test"
)
p <- p + ggpubr::stat_compare_means(
method = "wilcox.test", size = 3,
label = "p.format", hjust = .5, aes(x = 1.5)
)
p <- p + scale_y_continuous(expand = expansion(mult = c(0, .15)))
p2 <- p
ggpubr::ggarrange(p1, p2, legend = "none", widths = c(.3, 1), labels = "AUTO")
As observed from the plots, male students are the minority on the course and they are significantly older than female students. Men have significantly more positive attitude on statistics. Strategic learning does not differ between genders despite quite small \(p\) value.
Influence of gender (interactions)
Perhaps gender modifies the relationship between
learning strategies and exam points – let’s entertain this idea (without
making regression models with interactions yet).
# Define independent variables
traits <- tribble(
~var, ~lab, ~unit, ~zoomx,
"age", "age", "Years", c(10, 60),
"attitude", "attitude", "Averaged answer", c(1, 5),
"deep", "deep learning", "Averaged answer", c(1, 5),
"stra", "strategic learning", "Averaged answer", c(1, 5),
"surf", "surface learning", "Averaged answer", c(1, 5)
)
# Plot in a loop
plist <- lapply(1:nrow(traits), function (i) {
p <- data %>%
ggplot(aes(x = .data[[traits$var[i]]], y = points, col = gender, fill = gender))
p <- p + geom_point(size = .5, alpha = .5)
p <- p + geom_smooth(method = "lm")
p <- p + ggtitle(paste0("Student's ", traits$lab[[i]], "\nversus exam points"))
p <- p + coord_cartesian(xlim = traits$zoomx[[i]])
p <- p + xlab(traits$unit[[i]]) + ylab("Exam points")
p <- p + color_gender() + fill_gender()
p
})
# Combine plots into a panel
plist[[1]] + plist[[2]] + plist[[3]] + plist[[4]] + plist[[5]] +
plot_layout(guides = "collect")
The slopes for men and women seem quite similar. If anything, aging might hinder men’s exam success, whereas women do not experience age-related decline in exam points.
Lets make a “monster plot” of all the relationships within the data.
# Visualize
# pairs(data[-1])
data %>% GGally::ggpairs(
aes(col = gender, alpha = .3),
lower = list(continuous = "smooth")
) +
color_gender() + fill_gender()
Here it is demonstrated that the strongest correlation is observed
between attitude and points. In men, there is
a strong inverse correlation between surface and deep learning, whereas
such phenomenon doesn’t exist in women. This implies that women may
employ both strategies simultaneously, whereas in men these two
strategies may be mutually exclusive.
2.3 Regression model
Instructions
Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters.
Next I will fit a regression model that looks at the association
between attitude and exam points adjusted for
age and gender.
# Define formula
response <- "points"
IVs <- c("attitude", "age", "gender")
formula <- paste0(response, " ~ ", paste(IVs, collapse = " + "))
# Run initial model
m1 <- lm(formula, data)
m1_s <- summary(m1)
m1_s
##
## Call:
## lm(formula = formula, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.42910 2.29043 5.863 2.48e-08 ***
## attitude 3.60657 0.59322 6.080 8.34e-09 ***
## age -0.07586 0.05367 -1.414 0.159
## genderM -0.33054 0.91934 -0.360 0.720
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
Instructions
If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.
Only attitude is significant in our model. Let’s remove
non-significant predictors one by one and do it efficiently in a
loop.
# Drop useless covariates in a loop
results <- list(); models <- list()
while (max(m1_s$coefficients[-1, "Pr(>|t|)"]) > 0.05) {
# Identify the most non-significant covariate
nonsig <- with(m1_s, rownames(coefficients)[
coefficients[, "Pr(>|t|)"] == max(coefficients[, "Pr(>|t|)"])
]); nonsigF <- nonsig # Save "raw" covariate with factor level
# Remove factor value characters
while (!(nonsig %in% colnames(data))) {
nonsig <- substr(nonsig, 1, nchar(nonsig)-1)
}
# Store non-significant covariate
results[[nonsigF]] <- data.frame(
Dropped = nonsigF,
P = m1_s$coefficients[nonsigF, "Pr(>|t|)"] %>% signif(3)
)
# Remove covariate from model
IVs <- IVs[!(IVs %in% nonsig)]
# Quit if no predictors left
if (length(IVs) == 0) {print("All predictors NS"); break}
if (length(IVs) > 0) {
formula <- paste0(response, " ~ ", paste(IVs, collapse = " + "))
m1 <- lm(formula, data); m1_s <- summary(m1)
}
}
dropped <- bind_rows(results); dropped %>% knitr::kable()
| Dropped | P |
|---|---|
| genderM | 0.720 |
| age | 0.144 |
Both gender and age were dropped out one by
one due to their non-significant contribution to the model. Let’s see if
this this decision replicates with official rms::fastbw
function for backward variable selection:
# Run initial model
m_ols <- rms::ols(points ~ attitude + age + gender, data)
# Backward variable selection based on p value
rms::fastbw(m_ols, rule = "p")
##
## Deleted Chi-Sq d.f. P Residual d.f. P AIC R2
## gender 0.13 1 0.7192 0.13 1 0.7192 -1.87 0.201
## age 2.15 1 0.1426 2.28 2 0.3201 -1.72 0.191
##
## Approximate Estimates after Deleting Factors
##
## Coef S.E. Wald Z P
## Intercept 11.637 1.8288 6.363 1.975e-10
## attitude 3.525 0.5669 6.219 5.010e-10
##
## Factors in Final Model
##
## [1] attitude
Yup, we got the same result – the only predictor left in the model is
attitude. Thus, in the final model we’ll predict exam
points with only attitude.
2.4 Interpretation
Instructions
Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model. (0-3 points)
Let’s look at the final model:
# Final model
m1_s
##
## Call:
## lm(formula = formula, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
In this model, coefficient for attitude is 3.5, which
means that each one unit increase in attitude corresponds
to 3.5 points higher exam score. The intercept means that
in case attitude was zero, the amount of exam
points would be 11.6. However, this is only theoretical as
zero is outside of the likert scale’s range (1–5).
R-squared values (both multiple and adjusted = 0.19) show that only a small amount of variation (19%) in exam score can be explained by attitude. This is understandable as a lot of other things, and also our methods of measuring attitude could be imperfect.
2.5 Graphical model validation
Instructions
Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots
Linear regression has a few key assumptions:
- Linear relationship between the predictor
xand responsey. - Normality of the residuals.
- Homoscedasticity: the residuals are assumed to have constant variance.
Let’s see if our assumptions hold.
# Residuals
model.diag.metrics <- broom::augment(m1)
ggplot(model.diag.metrics, aes(attitude, points)) +
geom_segment(
aes(xend = attitude, yend = .fitted),
color = "orangered2", alpha = .7
) +
geom_point() +
stat_smooth(color = "white", method = lm, se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
# The actual diagnostic plots
m1_diag <- ggplot2::autoplot(m1, colour = "white", size = .5)
m1_diag
- The “Residuals vs Fitted” plot shows a horizontal line, without distinct patterns, which indicates that the predictor and response variables have a linear relationship.
- In the Normal Q-Q plot, standardized residuals roughly follow the dashed line and deviate only at extreme quantiles.
- The Scale-Location plot shows that residuals spread roughly equally along the range of the predictor, indicating that variances are constant.
- The Residuals vs Leverage plot identifies two observations that have standardized residuals greater than absolute value 3, which are possible outliers.
To conclude, the model assumptions hold up quite good.
# .Rmd file run time
end.time <- Sys.time()
elapsed.time <- (end.time - start.time) %>% round()
elapsed.time
## Time difference of 21 secs
3 Logistic regression
Setup
First let’s load the needed packages and some custom functions.
# Packages
pacman::p_load(tidyverse, GGally, patchwork, ggfortify)
# Run time
start.time <- Sys.time()
3.2 Get data
# Read data
datasets <- list.files("data/alc/ready", full.names = T)
newest <- datasets %>% sub('.*_', '', .) %>% as.Date() %>% max()
data <- datasets[grepl(newest, datasets)] %>%
readr::read_csv(show_col_types = F)
# Glance at the data
# glimpse(data)
finalfit::finalfit_glimpse(data)$Continuous %>% select(-label) %>% knitr::kable()
| var_type | n | missing_n | missing_percent | mean | sd | min | quartile_25 | median | quartile_75 | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| age | 370 | 0 | 0.0 | 16.6 | 1.2 | 15.0 | 16.0 | 17.0 | 17.0 | 22.0 | |
| Medu | 370 | 0 | 0.0 | 2.8 | 1.1 | 0.0 | 2.0 | 3.0 | 4.0 | 4.0 | |
| Fedu | 370 | 0 | 0.0 | 2.6 | 1.1 | 0.0 | 2.0 | 3.0 | 3.8 | 4.0 | |
| traveltime | 370 | 0 | 0.0 | 1.4 | 0.7 | 1.0 | 1.0 | 1.0 | 2.0 | 4.0 | |
| studytime | 370 | 0 | 0.0 | 2.0 | 0.8 | 1.0 | 1.0 | 2.0 | 2.0 | 4.0 | |
| famrel | 370 | 0 | 0.0 | 3.9 | 0.9 | 1.0 | 4.0 | 4.0 | 5.0 | 5.0 | |
| freetime | 370 | 0 | 0.0 | 3.2 | 1.0 | 1.0 | 3.0 | 3.0 | 4.0 | 5.0 | |
| goout | 370 | 0 | 0.0 | 3.1 | 1.1 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | |
| Dalc | 370 | 0 | 0.0 | 1.5 | 0.9 | 1.0 | 1.0 | 1.0 | 2.0 | 5.0 | |
| Walc | 370 | 0 | 0.0 | 2.3 | 1.3 | 1.0 | 1.0 | 2.0 | 3.0 | 5.0 | |
| health | 370 | 0 | 0.0 | 3.6 | 1.4 | 1.0 | 3.0 | 4.0 | 5.0 | 5.0 | |
| failures | 370 | 0 | 0.0 | 0.2 | 0.6 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | |
| absences | 370 | 0 | 0.0 | 4.5 | 5.5 | 0.0 | 1.0 | 3.0 | 6.0 | 45.0 | |
| G1 | 370 | 0 | 0.0 | 11.5 | 2.7 | 2.0 | 10.0 | 12.0 | 14.0 | 18.0 | |
| G2 | 370 | 0 | 0.0 | 11.5 | 2.8 | 4.0 | 10.0 | 12.0 | 14.0 | 18.0 | |
| G3 | 370 | 0 | 0.0 | 11.5 | 3.3 | 0.0 | 10.0 | 12.0 | 14.0 | 18.0 | |
| alc_use | 370 | 0 | 0.0 | 1.9 | 1.0 | 1.0 | 1.0 | 1.5 | 2.5 | 5.0 |
Description
The Student Performance data used for the assignment is
from the open UC
Irvine Machine Learning Repository. Briefly, the data describe
student performance in secondary education of two Portuguese schools.
Variables include student grades, demographic, social and school related
features, based on questionnaire results. The data were processed by
computing grades as average performance in two distinct subjects:
Mathematics (mat) and Portuguese language (por). More metadata of the
two original datasets can be found from
data/alc/docs/student.txt.
3.3 Hypotheses
Instructions
– choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption.
I’ll choose the following variables and list each corresponding hypothesis after the variable:
traveltimehome to school travel time, numeric:- 1 - <15 min.
- 2 - 15 to 30 min.
- 3 - 30 min. to 1 hour
- 4 - >1 hour
\(H_1:\) Location, location, location… Students that live in an urban setting (where the school is) might have easier access to bar and subsequent alcohol use.
studytimeweekly study time, numeric:- 1 - <2 hours
- 2 - 2 to 5 hours
- 3 - 5 to 10 hours
- 4 - >10 hours
\(H_1:\) Students that use a lot of alcohol may prioritize drinking in favor of studying. Students that put a lot of effort into studies may be inclined to use less alcohol.
failures- number of past class failures, numeric:- n if 1<=n<3
- else 4
\(H_1:\) It is plausible that alcohol use impairs cognitive abilities. Poor success in studies, indicated by failing tests and courses could be a sign of alcohol abuse.
romantic: with a romantic relationship, binary:- yes
- no
\(H_1:\) It is well-known that couples have healthier diet than singles - perhaps this also applies to the fourth macronutrient: alcohol.
3.4 Distributions
# Variables and their value labs
vars_hypo <- tribble(
~var, ~labs,
"traveltime", c("<15 min.", "15 to 30 min.", "30 min. to 1 hour", ">1 hour"),
"studytime", c("<2 hours", "2 to 5 hours", "5 to 10 hours", ">10 hours"),
"failures", c("n=0", "n=1", "n=2", "n=3"),
"romantic", c("Taken", "Single")
)
# Plot in a loop
plist <- lapply(1:nrow(vars_hypo), \ (i) {
p <- data %>%
mutate(
!!vars_hypo$var[[i]] := factor(
.data[[vars_hypo$var[[i]]]],
labels = vars_hypo$labs[[i]]
)
) %>%
ggplot(aes(x = .data[[vars_hypo$var[[i]]]]))
p <- p + geom_bar(aes(fill = .data[[vars_hypo$var[[i]]]]))
p <- p + theme(
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
legend.position = "none"
)
p <- p + harrypotter::scale_fill_hp_d("LunaLovegood")
p
})
ggpubr::ggarrange(
plotlist = plist,
align = "hv"
)
Data are ordinal but not interval, and also heavily skewed. Let’s use Spearman correlation coefficient, which is a non-parametric test that works for (at least) ordinal data.
# Plot
data[, c("alc_use", "high_use", vars_hypo$var)] %>%
GGally::ggpairs(
mapping = aes(color = romantic, fill = romantic), #, alpha = .3
upper = list(continuous = wrap("cor", method = "spearman"))
) +
harrypotter::scale_color_hp_d("LunaLovegood") +
harrypotter::scale_fill_hp_d("LunaLovegood") +
ggtitle("Links with alcohol consumption by relationship status")
Interpretation: We can see that alcohol use is
related to lower studytime (\(r =
-0.276\)) and higher rate of failures (\(r = 0.204\)), but spearman correlation
analysis reveals no significant link with traveltime (\(r = 0.084\)). The inverse relationship
between alcohol use and studytime may depend on
romantic status, as a strong, significant correlation
coefficient is seen only in single students (\(r = -0.321\)).
3.5 Logistic regression
# Variables
response <- "high_use"
predictors <- vars_hypo$var
form <- as.formula(paste0(response, " ~ ", paste(predictors, collapse = " + ")))
# Fit the model
m1 <- glm(formula = form, data = data)
m1_s <- summary(m1)
m1_s
##
## Call:
## glm(formula = form, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7727 -0.3221 -0.2293 0.5075 0.9569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.38023 0.08283 4.590 6.1e-06 ***
## traveltime 0.06816 0.03321 2.052 0.040865 *
## studytime -0.09312 0.02795 -3.331 0.000953 ***
## failures 0.11643 0.04261 2.732 0.006597 **
## romanticyes -0.03282 0.04934 -0.665 0.506412
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1959533)
##
## Null deviance: 77.700 on 369 degrees of freedom
## Residual deviance: 71.523 on 365 degrees of freedom
## AIC: 453.93
##
## Number of Fisher Scoring iterations: 2
# ORs and their CIs
CIs <- data.frame(
OR = coef(m1) %>% exp() %>% round(3), # Odds ratios
Lower = confint(m1)[, "2.5 %"] %>% exp() %>% round(3), # 95% CIs
Upper = confint(m1)[, "97.5 %"] %>% exp() %>% round(3), # 95% CIs
P = m1_s$coefficients[, "Pr(>|t|)"] %>% gtsummary::style_pvalue(prepend_p = T)
) %>% rownames_to_column(var = "Predictor")
# Plot estimates
p <- CIs %>%
filter(Predictor != "(Intercept)") %>%
ggplot(aes(x = OR, y = Predictor))
p <- p + geom_vline(xintercept = 1)
p <- p + geom_pointrange(aes(xmin = Lower, xmax = Upper), size = 1)
p <- p + geom_text(aes(
label = paste0(P, "\n\n")
))
p <- p + theme(
panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank()
)
p <- p + scale_x_continuous(trans = "log2", breaks = c(.8, .9, 1, 1.1, 1.2))
p <- p + ggtitle("Forest plot") + xlab("OR for high alcohol use (95% CI)") + ylab("")
p
# Numeric output
CIs %>% knitr::kable()
| Predictor | OR | Lower | Upper | P |
|---|---|---|---|---|
| (Intercept) | 1.463 | 1.243 | 1.720 | p<0.001 |
| traveltime | 1.071 | 1.003 | 1.143 | p=0.041 |
| studytime | 0.911 | 0.863 | 0.962 | p<0.001 |
| failures | 1.123 | 1.033 | 1.221 | p=0.007 |
| romanticyes | 0.968 | 0.879 | 1.066 | p=0.5 |
Interpretation: Odds ratios measure the association between exposure(s) and outcome (high alcohol use). An OR greater than 1 indicates that exposed individuals are more likely high alcohol users. OR smaller than 1 leads us to believe that the exposure is more common in low alcohol users. If the 95% confidence interval overlaps 1 (null), we can not reject null hypothesis \(H_0\) with confidence level \(a = 0.05\).
As we see from the forest plot and counter to our hypothesis, longer
traveltime is borderline significantly associated with high
alcohol use. As hypothesized above, studytime shows
inverse association with high alcohol use and failures
a positive association.
3.6 Predictive power
# Significant variables
sig_vars <- rownames(m1_s$coefficients)[which(m1_s$coefficients[, "Pr(>|t|)"] < .05)][-1]
# Fit the model
form <- as.formula(paste0(response, " ~ ", paste(sig_vars, collapse = " + ")))
m1 <- glm(formula = form, data = data)
m1_s <- summary(m1)
m1_s
##
## Call:
## glm(formula = form, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7595 -0.3201 -0.2519 0.5178 0.9359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.37144 0.08171 4.546 7.45e-06 ***
## traveltime 0.06820 0.03319 2.055 0.040590 *
## studytime -0.09389 0.02791 -3.365 0.000848 ***
## failures 0.11517 0.04254 2.707 0.007100 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1956548)
##
## Null deviance: 77.70 on 369 degrees of freedom
## Residual deviance: 71.61 on 366 degrees of freedom
## AIC: 452.37
##
## Number of Fisher Scoring iterations: 2
# Prediction and observing the data
prob <- predict(m1, type = "response")
data <- mutate(data, probability = prob)
data <- mutate(data, prediction = probability > 0.5)
# 2x2 cross tabulation
data %>%
{table(IRL = .$high_use, prediction = .$prediction)}
## prediction
## IRL FALSE TRUE
## FALSE 250 9
## TRUE 94 17
# Table the target variable versus the predictions
data %>%
{table(IRL_use = .$high_use, predicted_use = .$prediction)} %>%
prop.table() %>% addmargins() %>% round(2)
## predicted_use
## IRL_use FALSE TRUE Sum
## FALSE 0.68 0.02 0.70
## TRUE 0.25 0.05 0.30
## Sum 0.93 0.07 1.00
# Average number of wrong predictions in the (training) data
loss_func <- function (class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# Call loss_func
loss_func(class = data$high_use, prob = 0)
## [1] 0.3
loss_func(class = data$high_use, prob = 1)
## [1] 0.7
data %>% {loss_func(class = .$high_use, prob = .$probability)}
## [1] 0.2783784
## Compute AUC metrics
# Our model
perf_m1 <- ROCR::prediction(prob, data$high_use) %>%
ROCR::performance(measure = "tpr", x.measure = "fpr")
# Simple guessing strategy
guess_fun <- \ (i) {
truth <- runif(1:length(i), min = 0, max = 1) %>% return()
}
guessings <- guess_fun(m1$residuals)
perf_guess <- ROCR::prediction(guessings, data$high_use) %>%
ROCR::performance(measure = "tpr", x.measure = "fpr")
# Plot ROC curve
p <- ggplot()
p <- p + geom_abline(slope = 1, intercept = 0, color = "gray80")
p <- p + geom_line(
linetype = "dashed", size = 1,
aes(x = perf_m1@x.values[[1]], y = perf_m1@y.values[[1]], color = "Model")
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
p <- p + geom_line(
linetype = "dashed", size = 1,
aes(x = perf_guess@x.values[[1]], y = perf_guess@y.values[[1]], color = "Guess")
)
p <- p + xlab(perf_m1@x.name) + ylab(perf_m1@y.name)
p <- p + scale_color_manual(name = "Model", values = c("orangered", "orange"))
p
Our model resulted in 25% false negatives and 2% false positives. There were 5% true positives and 68% true negatives, indicating that our model has good specificity. Although our model is not perfect, it performs much better than a simple guessing strategy (flipping a coin), as demonstrated with the ROC curves.
3.7 Bonus: 10-fold cross-validation
# Perform 10-fold CV
# Define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# K-fold cross-validation
cv <- boot::cv.glm(
data = data,
cost = loss_func,
glmfit = m1,
K = 10
)
# Average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2891892
The average percentage of wrong predictions in the cross validation
was ~30%, which is pretty high and higher than in the
Exercise3.Rmd set (26%). We might not have the best
variables or enough of them in the model.
Let’s try making a model that has smaller prediction error.
# Lets try this one variable at a time
vars <- names(data)[!(names(data) %in% c(
"alc_use", "high_use", "Dalc", "Walc", "probability", "prediction"
))]
# Most important vars
form <- as.formula(paste0("high_use ~ ", paste(vars, collapse = " + ")))
m <- glm(formula = form, family = "binomial", data = data)
# m %>% summary()
# Variable Importance
importance <- caret::varImp(m, scale = FALSE) %>%
arrange(Overall) %>%
rownames_to_column("Variable")
importance %>%
ggplot(aes(x = reorder(Variable, Overall), y = Overall)) + geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
# Loop
results <- list()
for (var in vars) {
form <- as.formula(paste0("high_use ~ ", var))
m_s <- glm(formula = form, family = "binomial", data = data) %>% summary()
results[[var]] <- data.frame(
var = var,
P = m_s$coefficients[2, "Pr(>|z|)"]
)
}
results <- bind_rows(results) %>% arrange(P)
results %>% knitr::kable()
| var | P |
|---|---|
| goout | 0.0000000 |
| studytime | 0.0000566 |
| sex | 0.0000867 |
| absences | 0.0001296 |
| failures | 0.0006145 |
| freetime | 0.0020304 |
| G1 | 0.0037001 |
| G2 | 0.0075765 |
| traveltime | 0.0086875 |
| G3 | 0.0121759 |
| famrel | 0.0204223 |
| higher | 0.0255062 |
| age | 0.0437592 |
| address | 0.0674297 |
| health | 0.1343485 |
| famsize | 0.1442606 |
| nursery | 0.2086710 |
| activities | 0.2294490 |
| reason | 0.2619576 |
| schoolsup | 0.3676697 |
| guardian | 0.3960535 |
| school | 0.3968021 |
| Fjob | 0.4182266 |
| famsup | 0.4397546 |
| internet | 0.5098542 |
| romantic | 0.5121903 |
| paid | 0.5246335 |
| Pstatus | 0.8226210 |
| Mjob | 0.8632720 |
| Fedu | 0.9001225 |
| Medu | 0.9331558 |
# Pick most promising ones
form <- as.formula(paste0("high_use ~ ", paste(results$var[1:5], collapse = " + ")))
m1 <- glm(formula = form, family = "binomial", data = data)
m_s <- summary(m1)
m_s
##
## Call:
## glm(formula = form, family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1901 -0.7637 -0.4999 0.7560 2.4494
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.28866 0.60304 -5.453 4.94e-08 ***
## goout 0.70261 0.12135 5.790 7.03e-09 ***
## studytime -0.37986 0.17479 -2.173 0.029757 *
## sexM 0.82696 0.27173 3.043 0.002340 **
## absences 0.07747 0.02272 3.410 0.000649 ***
## failures 0.38962 0.23608 1.650 0.098860 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 364.54 on 364 degrees of freedom
## AIC: 376.54
##
## Number of Fisher Scoring iterations: 4
# Backward variable selection
step_m <- MASS::stepAIC(m1, direction = "both", trace = F)
step_m %>% summary()
##
## Call:
## glm(formula = high_use ~ goout + studytime + sex + absences +
## failures, family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1901 -0.7637 -0.4999 0.7560 2.4494
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.28866 0.60304 -5.453 4.94e-08 ***
## goout 0.70261 0.12135 5.790 7.03e-09 ***
## studytime -0.37986 0.17479 -2.173 0.029757 *
## sexM 0.82696 0.27173 3.043 0.002340 **
## absences 0.07747 0.02272 3.410 0.000649 ***
## failures 0.38962 0.23608 1.650 0.098860 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 364.54 on 364 degrees of freedom
## AIC: 376.54
##
## Number of Fisher Scoring iterations: 4
# Perform 10-fold CV
cv <- boot::cv.glm(
data = data,
cost = loss_func,
glmfit = step_m,
K = 10
)
# Average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2189189
Now we got a smaller prediction error using 10-fold cross-validation compared to the Exercise Set (0.22 vs. 0.26)!
3.8 Super-Bonus: different sets of predictors
Finding out what is the optimal number of predictors to achieve as little prediction error as possible!
# In a loop based on importance
importance <- caret::varImp(m, scale = FALSE) %>%
arrange(Overall) %>%
rownames_to_column("Variable")
# Remove factor levels from tailing the column names
vars <- c()
for (var in importance$Variable) {
while (!(var %in% names(data))) {
var <- stringr::str_sub(var, end = -2)
}
vars <- c(vars, var)
}
# The actual loop
results <- list()
while (length(vars) >= 2) {
vars <- tail(vars, -1)
cat(paste0(length(vars)), " ")
# Formula
form <- as.formula(paste0("high_use ~ ", paste(vars, collapse = " + ")))
# Model
loop_m <- glm(formula = form, family = "binomial", data = data)
# Perform 10-fold CV
cv <- boot::cv.glm(
data = data,
cost = loss_func,
glmfit = loop_m,
K = 10
)
# Pick relevant stuff into a row
results[[length(vars)]] <- data.frame(
N_variables = length(vars),
Avg_wrong_preds = cv$delta[1] # Avg N of wrong predictions
)
}
## 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
results <- bind_rows(results)# %>% arrange(P)
results %>% ggplot(aes(x = N_variables, y = Avg_wrong_preds)) +
geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
We have now demonstrated that lowest prediction error is achieved by including ~10 of the important variables!!
# .Rmd file run time
end.time <- Sys.time()
elapsed.time <- (end.time - start.time) %>% round()
elapsed.time
## Time difference of 17 secs
4 Clustering and classification
Setup
First let’s load the needed packages and some custom functions.
# Packages
pacman::p_load(tidyverse)
# Custom functions
source("src/fig_setup.R")
# Run time
start.time <- Sys.time()
4.2 Get data
Instructions
Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here. (0-1 points)
Save the data to an object and check its structure
# Read data
data <- MASS::Boston
# Glance at the data
finalfit::finalfit_glimpse(data)$Continuous %>%
select(-label) %>%
knitr::kable()
| var_type | n | missing_n | missing_percent | mean | sd | min | quartile_25 | median | quartile_75 | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| crim | 506 | 0 | 0.0 | 3.6 | 8.6 | 0.0 | 0.1 | 0.3 | 3.7 | 89.0 | |
| zn | 506 | 0 | 0.0 | 11.4 | 23.3 | 0.0 | 0.0 | 0.0 | 12.5 | 100.0 | |
| indus | 506 | 0 | 0.0 | 11.1 | 6.9 | 0.5 | 5.2 | 9.7 | 18.1 | 27.7 | |
| chas | 506 | 0 | 0.0 | 0.1 | 0.3 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | |
| nox | 506 | 0 | 0.0 | 0.6 | 0.1 | 0.4 | 0.4 | 0.5 | 0.6 | 0.9 | |
| rm | 506 | 0 | 0.0 | 6.3 | 0.7 | 3.6 | 5.9 | 6.2 | 6.6 | 8.8 | |
| age | 506 | 0 | 0.0 | 68.6 | 28.1 | 2.9 | 45.0 | 77.5 | 94.1 | 100.0 | |
| dis | 506 | 0 | 0.0 | 3.8 | 2.1 | 1.1 | 2.1 | 3.2 | 5.2 | 12.1 | |
| rad | 506 | 0 | 0.0 | 9.5 | 8.7 | 1.0 | 4.0 | 5.0 | 24.0 | 24.0 | |
| tax | 506 | 0 | 0.0 | 408.2 | 168.5 | 187.0 | 279.0 | 330.0 | 666.0 | 711.0 | |
| ptratio | 506 | 0 | 0.0 | 18.5 | 2.2 | 12.6 | 17.4 | 19.1 | 20.2 | 22.0 | |
| black | 506 | 0 | 0.0 | 356.7 | 91.3 | 0.3 | 375.4 | 391.4 | 396.2 | 396.9 | |
| lstat | 506 | 0 | 0.0 | 12.7 | 7.1 | 1.7 | 6.9 | 11.4 | 17.0 | 38.0 | |
| medv | 506 | 0 | 0.0 | 22.5 | 9.2 | 5.0 | 17.0 | 21.2 | 25.0 | 50.0 |
# Uncomment for data description
# ?MASS::Boston
The Boston
data contains information about suburbs in Boston, Massachusetts. It is
from the MASS package and has 506 rows and 14 columns.
Briefly, the data describe .
crim: per capita crime rate by town.zn: proportion of residential land zoned for lots over 25,000 sq.ft.indus: proportion of non-retail business acres per town.chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).nox: nitrogen oxides concentration (parts per 10 million).rm: average number of rooms per dwelling.age: proportion of owner-occupied units built prior to 1940.dis: weighted mean of distances to five Boston employment centres.rad: index of accessibility to radial highways.tax: full-value property-tax rate per $10,000.ptratio: pupil-teacher ratio by town.black: the proportion of blacks by town.lstat: lower status of the population (percent).medv: median value of owner-occupied homes in $1000s.
Source
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.
Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
4.3 Visualize
Instructions
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)
Let’s start by exploring our data.
# "chas" to nominal
# data <- data %>%
# mutate(
# chas = case_when(
# chas == 1 ~ "Bounds river",
# T ~ "Does not bound river"
# )
# )
# Plot
data %>%
GGally::ggpairs(
lower = list(continuous = "points", size = .01),
upper = list(continuous = GGally::wrap("cor", method = "spearman")),
)
Distributions
The distributions of crim, zn,
age, dis, pratio,
black, lstat are heavily skewed and it is
questionable to apply any parametric statistics on them (without any
transformations such as box-cox or ordernorm). Variables
indus, rad, tax have
multimodal distributions with two distinct peaks. The variable
chas is a dichotomous nominal variable coded as 0 and
1.
Relationships
The majority of the variables are intercorrelated. For example, there
is a strong positive association between nox and
crim (\(\rho = 0.82\)).
lstat and medv are inversely correlated (\(\rho = -0.85\))
4.4 Standardize
Instructions
Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. (0-2 points)
Let’s scale the data so that every variable has a mean \(\overline{x} = 0\) and sd \(σ = 1\).
# Standardize
data_z <- scale(data) %>% data.frame()
# Cut into quartiles
quant <- 4
data_z <- data_z %>% mutate(
crime_quant = gtools::quantcut(crim, q = 4, labels = paste0("Q", 1:quant))
) %>% select(-crim) # Remove original "crim"
# Summaries
summs <- finalfit::finalfit_glimpse(data_z)
summs[[1]] %>% knitr::kable()
| label | var_type | n | missing_n | missing_percent | mean | sd | min | quartile_25 | median | quartile_75 | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| zn | zn | 506 | 0 | 0.0 | 0.0 | 1.0 | -0.5 | -0.5 | -0.5 | 0.0 | 3.8 | |
| indus | indus | 506 | 0 | 0.0 | 0.0 | 1.0 | -1.6 | -0.9 | -0.2 | 1.0 | 2.4 | |
| chas | chas | 506 | 0 | 0.0 | -0.0 | 1.0 | -0.3 | -0.3 | -0.3 | -0.3 | 3.7 | |
| nox | nox | 506 | 0 | 0.0 | -0.0 | 1.0 | -1.5 | -0.9 | -0.1 | 0.6 | 2.7 | |
| rm | rm | 506 | 0 | 0.0 | -0.0 | 1.0 | -3.9 | -0.6 | -0.1 | 0.5 | 3.6 | |
| age | age | 506 | 0 | 0.0 | -0.0 | 1.0 | -2.3 | -0.8 | 0.3 | 0.9 | 1.1 | |
| dis | dis | 506 | 0 | 0.0 | 0.0 | 1.0 | -1.3 | -0.8 | -0.3 | 0.7 | 4.0 | |
| rad | rad | 506 | 0 | 0.0 | 0.0 | 1.0 | -1.0 | -0.6 | -0.5 | 1.7 | 1.7 | |
| tax | tax | 506 | 0 | 0.0 | 0.0 | 1.0 | -1.3 | -0.8 | -0.5 | 1.5 | 1.8 | |
| ptratio | ptratio | 506 | 0 | 0.0 | -0.0 | 1.0 | -2.7 | -0.5 | 0.3 | 0.8 | 1.6 | |
| black | black | 506 | 0 | 0.0 | -0.0 | 1.0 | -3.9 | 0.2 | 0.4 | 0.4 | 0.4 | |
| lstat | lstat | 506 | 0 | 0.0 | -0.0 | 1.0 | -1.5 | -0.8 | -0.2 | 0.6 | 3.5 | |
| medv | medv | 506 | 0 | 0.0 | -0.0 | 1.0 | -1.9 | -0.6 | -0.1 | 0.3 | 3.0 |
summs[[2]] %>% knitr::kable()
| label | var_type | n | missing_n | missing_percent | levels_n | levels | levels_count | levels_percent | |
|---|---|---|---|---|---|---|---|---|---|
| crime_quant | crime_quant | 506 | 0 | 0.0 | 4 | “Q1”, “Q2”, “Q3”, “Q4”, “(Missing)” | 127, 126, 126, 127 | 25, 25, 25, 25 |
Let’s divide 80% of the data to train and 20% test sets.
# N of rows in the data
n <- nrow(data_z)
# Choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# Create train and test sets
train <- data_z[ind, ]
test <- data_z[-ind, ]
4.5 Discriminant analysis
Instructions
Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot. (0-3 points)
# Fit the LDA analysis on train set
lda.fit <- MASS::lda(crime_quant ~ ., data = train)
lda.fit
## Call:
## lda(crime_quant ~ ., data = train)
##
## Prior probabilities of groups:
## Q1 Q2 Q3 Q4
## 0.2524752 0.2574257 0.2475248 0.2425743
##
## Group means:
## zn indus chas nox rm age
## Q1 0.9642770 -0.9356567 -0.11793298 -0.8969225 0.41623500 -0.9098477
## Q2 -0.0943372 -0.3352807 0.03052480 -0.6130790 -0.09178563 -0.4428686
## Q3 -0.3826200 0.2025874 0.20012296 0.3776822 0.11580363 0.3821859
## Q4 -0.4872402 1.0171960 -0.07145661 1.0967657 -0.45778073 0.8228338
## dis rad tax ptratio black lstat
## Q1 0.9353872 -0.6823690 -0.7355795 -0.46581993 0.37996454 -0.767410007
## Q2 0.3977361 -0.5478832 -0.5048342 -0.06384729 0.31617128 -0.197286142
## Q3 -0.3736429 -0.4478340 -0.3267954 -0.27554208 0.06319927 -0.001829314
## Q4 -0.8478902 1.6373367 1.5134896 0.77985517 -0.82883991 0.932234071
## medv
## Q1 0.52334649
## Q2 0.05372525
## Q3 0.18007774
## Q4 -0.71319590
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09786646 0.55969291 -1.05858096
## indus 0.11509787 -0.34517546 0.22964273
## chas -0.11699572 -0.05222132 0.15799773
## nox 0.33700237 -0.88350161 -1.22626146
## rm -0.11719656 -0.17791076 -0.14097512
## age 0.14865726 -0.29061395 -0.25189747
## dis -0.05989425 -0.19391922 0.16023303
## rad 3.66413919 0.99821098 -0.09095403
## tax 0.01880921 0.12744457 0.56486490
## ptratio 0.10154734 -0.10443481 -0.22264913
## black -0.08629068 0.02952204 0.09918057
## lstat 0.25306375 -0.21155364 0.37879821
## medv 0.21330868 -0.38073605 -0.12592471
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9549 0.0345 0.0106
# Draw the LDA biplot
p <- ggord::ggord(
lda.fit,
train$crime_quant,
size = 1,
labcol = "white",
veccol = "white",
repel = T
)
# p <- p + ggsci::scale_color_lancet()
# p <- p + ggsci::scale_fill_lancet()
pal <- "PonyoMedium"
p <- p + ghibli::scale_color_ghibli_d(pal)
p <- p + ghibli::scale_fill_ghibli_d(pal)
p <- p + theme_darkmode()
# Print without white edges
grid::grid.newpage()
grid::grid.draw(grid::rectGrob(gp = grid::gpar(fill = backg)))
print(p, newpage = FALSE)
4.6 Class prediction
Instructions
Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results. (0-3 points)
# Save the correct classes from test data
correct_classes <- test$crime_quant
# Predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
ctab <- table(correct = correct_classes, predicted = lda.pred$class)
ctab; ctab %>% prop.table() %>% round(2) #%>% {. * 100}
## predicted
## correct Q1 Q2 Q3 Q4
## Q1 18 4 3 0
## Q2 3 13 6 0
## Q3 0 5 18 3
## Q4 0 0 0 29
## predicted
## correct Q1 Q2 Q3 Q4
## Q1 0.18 0.04 0.03 0.00
## Q2 0.03 0.13 0.06 0.00
## Q3 0.00 0.05 0.18 0.03
## Q4 0.00 0.00 0.00 0.28
The confusion matrix above tells us that most classes were predicted correctly. Predictions for Q1 suburbs were spread broadly across three different quartiles, whereas most/all Q4 suburbs were correctly classified. There were some false positives as well as negatives.
# Overall model accuracy
accuracy <- mean(lda.pred$class == correct_classes) %>% scales::percent()
accuracy
## [1] "76%"
# Misclassification rate
misclass <- mean(lda.pred$class != correct_classes) %>% scales::percent()
misclass
## [1] "24%"
In general, model accuracy is 76% and misclassification rate 24%. Since the training and testing sets are randomly assigned, the results change every time the code is run. Crossvalidation could provide better credence for evaluating the model performance.
4.7 Clustering
Instructions
Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)
Let’s scale the dataset and calculate Eucledian distances between the observations.
# Reload the dataset
boston <- MASS::Boston
# Standardize
boston_z <- boston %>% scale() %>% data.frame()
# Calculate distances between observations
dist_eu <- dist(boston_z)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# Determine max N of clusters
k_max <- 10
# Calculate total within-cluster sum of squares (TWSS) for different cluster numbers
# Uncomment to test
# TWSS <- sapply(1:k_max, \ (k) kmeans(boston_z, centers = k)$tot.withinss)
# Iterate in a loop to acquire more precise estimates
results <- list()
for (i in 1:50) {
TWSS <- sapply(1:k_max, \ (k) kmeans(boston_z, centers = k)$tot.withinss)
results[[i]] <- data.frame(iteration = i, N = 1:length(TWSS), TWSS)
}
results <- results %>% bind_rows()
# Visualize
results %>%
ggplot(aes(x = N, y = TWSS, color = iteration)) +
geom_point(size = .2) +
geom_line(alpha = .2, aes(group = iteration), linewidth = .2) +
# stat_summary(fun = mean, geom = "line") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .2) +
scale_x_continuous(breaks = 1:k_max) +
xlab("Number of clusters") +
ylab("Total of Within Cluster Sum of Squares") +
ggtitle("Finding the optimal number of clusters") +
scale_color_gradientn(colours = c("orange", "orangered", "purple"))
Based on 50 simulations, the most dramatic drop in TWSS is achieved
when increasing the number of clusters from 1 to
2. Only diminshing returns are achieved after this drop.
Thus, we we will continue with 2 clusters.
# Calculate k-means with optimal N of clusters
km <- kmeans(boston_z, centers = 2)
# Add clusters to data frame
boston_z <- boston_z %>%
mutate(
cluster = factor(km$cluster)
)
# Plot
boston_z %>%
reshape2::melt(id.vars = "cluster") %>%
ggplot(aes(x = cluster, y = value, color = cluster)) +
ggforce::geom_sina(size = .5, alpha = .7) +
stat_summary(
fun.data = mean_cl_normal,
color = "white", geom = "errorbar", width = .2
) +
ggpubr::stat_compare_means(
vjust = 1, method = "wilcox.test",
aes(x = 1.5, label = paste0("P=", after_stat(p.adj), 2))
) +
theme(
panel.grid.major.x = element_blank()
) +
scale_color_manual(values = c("orange", "orangered2")) +
xlab("Cluster") + ylab("Standard deviations") +
facet_wrap(. ~ variable, ncol = 7)
The most clear separation between the clusters can be seen in
variables nox, tax, crim and
indus. P-values for between-cluster differences are crazy
exponentially small, which highlights the power of k-mean clustering. A
full scatter plot was also made with GGally::ggpairs(), but
it was hard to read and narrowing it could omit important
information.
Bonus
Instructions
Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influential linear separators for the clusters? (0-2 points to compensate any loss of points from the above exercises)
Let’s pick 4 clusters since it is reasonable and adding clusters only marginally decreases TWSS.
# Reload and standardize
boston_z <- MASS::Boston %>% scale() %>% data.frame()
set.seed(14)
# Calculate k-means with optimal N of clusters
boston_z <- boston_z %>%
mutate(cluster = factor(kmeans(boston_z, centers = 4)$cluster))
# Fit the LDA analysis
lda.fit <- MASS::lda(cluster ~ ., data = boston_z)
lda.fit
## Call:
## lda(cluster ~ ., data = boston_z)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.2075099 0.3913043 0.1521739 0.2490119
##
## Group means:
## crim zn indus chas nox rm age
## 1 -0.4059307 1.5055258 -1.0769427 -0.04735191 -0.9382785 1.0056332 -1.0925806
## 2 -0.3887465 -0.3031717 -0.5295044 -0.27232907 -0.4668682 -0.1554981 -0.2169976
## 3 -0.2331330 -0.4761033 1.1327425 1.21047492 0.8708781 -0.2446114 0.8102252
## 4 1.0916331 -0.4872402 1.0372990 -0.27232907 0.9833456 -0.4441889 0.7563424
## dis rad tax ptratio black lstat
## 1 1.1031705 -0.5979551 -0.669170174 -0.8342056 0.35324020 -0.9616973
## 2 0.2322771 -0.5828079 -0.619860360 0.0669631 0.31917567 -0.2126106
## 3 -0.7610345 -0.3420110 -0.003179451 -0.3519844 0.01171309 0.3362920
## 4 -0.8192391 1.6231436 1.533651089 0.8050452 -0.80308660 0.9300049
## medv
## 1 1.042517945
## 2 -0.049475198
## 3 -0.009921371
## 4 -0.784955155
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim 0.0489406106 0.065864378 -0.14067938
## zn -0.0066409931 0.249556205 -1.27609025
## indus 1.3045650331 -1.693333056 -0.90819925
## chas -0.0001958561 -0.931387714 -0.33512520
## nox -0.2349480303 -0.467801127 -0.48206161
## rm -0.1021440667 0.275749340 -0.46694662
## age 0.2251062010 -0.432131803 0.23856949
## dis -0.1624300222 -0.275771504 -0.34799011
## rad 2.4759323083 1.612168328 0.14777848
## tax 0.1436252086 0.287632962 -0.21979788
## ptratio -0.1601500525 0.088820580 0.13309427
## black -0.0960387308 -0.004056644 0.04242377
## lstat 0.1018658127 0.221159432 -0.37295017
## medv -0.2670581604 -0.113293007 -0.63305675
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.7251 0.1927 0.0822
# Draw the LDA biplot
p <- ggord::ggord(
lda.fit,
boston_z$cluster,
size = 1,
labcol = "white",
veccol = "white",
repel = T
)
p <- p + ggsci::scale_color_lancet()
p <- p + ggsci::scale_fill_lancet()
p <- p + theme_darkmode()
# Print without white edges
grid::grid.newpage()
grid::grid.draw(grid::rectGrob(gp = grid::gpar(fill = backg)))
print(p, newpage = FALSE)
## Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
It appears that most of the times, rad,
indus and zn are the most influential linear
separators for the clusters.
Super-Bonus
Instructions
Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
# Use old train set that was analysed above
model_predictors <- train %>% select(-crime_quant)
# LDA model
lda.fit <- MASS::lda(crime_quant ~ ., data = train)
# Check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# Matrix multiplication
matrix_product <- (as.matrix(model_predictors) %*% lda.fit$scaling) %>%
data.frame()
Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below. Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities? (0-3 points to compensate any loss of points from the above exercises)
Let’s first do the interactive 3D plot as crime quartile as the response variable.
# Crime quartile as the response variable
plotly::plot_ly(
x = matrix_product$LD1,
y = matrix_product$LD2,
z = matrix_product$LD3,
type = 'scatter3d',
mode = 'markers',
color = train$crime_quant,
colors = c("orange", "orangered", "purple", "dodgerblue2")
) %>%
plotly::layout(
font = list(color = 'white'),
title = 'Crime quartile as the response variable',
scene = list(
xaxis = list(title = 'LD1'),
yaxis = list(title = 'LD2'),
zaxis = list(title = 'LD3')
),
paper_bgcolor = backg
)
Then let’s change the response variable to K-mean clusters obtained earlier in this assignment.
# Join
km_data <- train %>%
rownames_to_column("id") %>%
left_join(
boston_z %>% rownames_to_column("id") %>% select(id, cluster),
by = "id"
) %>% select(-id)
# K-mean cluster as the response variable
plotly::plot_ly(
x = matrix_product$LD1,
y = matrix_product$LD2,
z = matrix_product$LD3,
type = 'scatter3d',
mode = 'markers',
color = km_data$cluster,
colors = c("orange", "orangered", "purple", "dodgerblue2")
) %>%
plotly::layout(
font = list(color = 'white'),
title = 'K-mean cluster as the response variable',
scene = list(
xaxis = list(title = 'LD1'),
yaxis = list(title = 'LD2'),
zaxis = list(title = 'LD3')
),
paper_bgcolor = backg
)
The 3D plots seem almost identical irrespective of the grouping
(color) variable (Crime quartile or K-mean clusters). There are only
minor differences, such as a couple of blue dots in the purple cluster
when using K-mean clusters to define colors. Fascinating!! The only
possible explanation for the strikingly similar color patterns is that
the crim variable is an important linear separator for the
clusters and/or is intercorrelated to various other variables in the
dataset.
# Run time
end.time <- Sys.time()
elapsed.time <- (end.time - start.time) %>% round()
elapsed.time
## Time difference of 26 secs
5 Dimensionality reduction techniques
Setup
# Packages
pacman::p_load(tidyverse)
# Custom functions
source("src/fig_setup.R")
# Run time
start.time <- Sys.time()
5.0 Get data
Automatically choose the newest version of the wrangled
human dataset.
# Read the most up-to-date data
datasets <- list.files("data/human/ready", full.names = T)
newest <- datasets %>% sub('.*_', '', .) %>% as.Date() %>% max()
data <- datasets[grepl(newest, datasets)] %>%
readr::read_csv(show_col_types = F)
5.1 Inspect
Numerical and graphical overview:
# Country to row names
data <- data %>%
column_to_rownames("Country")
# Glance at the data
finalfit::finalfit_glimpse(data)$Continuous %>%
select(-label) %>%
knitr::kable()
| var_type | n | missing_n | missing_percent | mean | sd | min | quartile_25 | median | quartile_75 | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Edu2.FM | 155 | 0 | 0.0 | 0.9 | 0.2 | 0.2 | 0.7 | 0.9 | 1.0 | 1.5 | |
| Labo.FM | 155 | 0 | 0.0 | 0.7 | 0.2 | 0.2 | 0.6 | 0.8 | 0.9 | 1.0 | |
| Edu.Exp | 155 | 0 | 0.0 | 13.2 | 2.8 | 5.4 | 11.2 | 13.5 | 15.2 | 20.2 | |
| Life.Exp | 155 | 0 | 0.0 | 71.7 | 8.3 | 49.0 | 66.3 | 74.2 | 77.2 | 83.5 | |
| GNI | 155 | 0 | 0.0 | 17627.9 | 18543.9 | 581.0 | 4197.5 | 12040.0 | 24512.0 | 123124.0 | |
| Mat.Mor | 155 | 0 | 0.0 | 149.1 | 211.8 | 1.0 | 11.5 | 49.0 | 190.0 | 1100.0 | |
| Ado.Birth | 155 | 0 | 0.0 | 47.2 | 41.1 | 0.6 | 12.6 | 33.6 | 72.0 | 204.8 | |
| Parli.F | 155 | 0 | 0.0 | 20.9 | 11.5 | 0.0 | 12.4 | 19.3 | 27.9 | 57.5 |
# Visualize
GGally::ggpairs(
data,
upper = list(continuous = GGally::wrap("cor", method = "spearman"))
)
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
## Warning in cor.test.default(x, y, method = method, use = use): Cannot compute
## exact p-value with ties
5.2 PCA using raw values
Let’s deliberately make a principal component analysis with non-standardized variables.
# Principal component analysis
PCA <- prcomp(data, center = F, scale. = F)
PCA_s <- summary(PCA)
importances <- scales::percent(PCA_s$importance[2, ], .01)
importances
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## "99.99%" "0.01%" "0.00%" "0.00%" "0.00%" "0.00%" "0.00%" "0.00%"
# Plot
factoextra::fviz_pca_biplot(
PCA, repel = TRUE, col.var = "white", col.ind = "gray40", labelsize = 3,
) + theme_darkmode() +
ggtitle("GNI in PC1 explains all variance")
Problem: Because the variables have differing scales, now variables with large absolute values dominate the GNI that is loaded in PC1 explains all the variability. This is not OK and we have to standardize the data asap.
5.3 PCA using standardized values
A more optimal way to do principal component analysis is to have the variables on a similar measurement scale. (However there is still the limitation that various variables have skewed distributions).
# Standardize all variables
data_z <- scale(data) %>% data.frame()
# Principal component analysis
PCA <- prcomp(data_z, center = F, scale. = F)
PCA_s <- summary(PCA)
importances <- scales::percent(PCA_s$importance[2, ], .01)
# Plot
factoextra::fviz_pca_biplot(
PCA, repel = TRUE, col.var = "white", col.ind = "gray40", labelsize = 3,
) + theme_darkmode() +
ggtitle("Variables are now balanced")
As expected, results are much more balanced when variables are now
measured at the same scale. Most influential variables include
Labo.FMm Mar.Mor and
Ado.Birth.
5.4 Personal interpretation
The 2nd PC shown above explains 16% of the variability in the data
which is a good number. As always, most is explained by PC1 (54% here).
Education and expected life span have negative loadings for the 1st PC.
Maternal mortality and adolescent birth rate have positive loadings.
These negatively and positively loaded variables have inverse
correlation in the dataset, e.g., Mat.Mor and
Edu.Exp: \(r=0.7\).
5.5 MCA
Multiple Correspondence Analysis (MCA) is a generalization of
principal component analysis when the variables of interest are
categorical instead of quantitative. Good package
factoextra for visualization: http://rpkgs.datanovia.com/factoextra/reference/fviz_mca.html.
We are applying MCA to Let’s see what kind of data we are dealing with:
# Get data
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
glimpse(tea)#; View(tea)
## Rows: 300
## Columns: 36
## $ breakfast <fct> breakfast, breakfast, Not.breakfast, Not.breakfast, b…
## $ tea.time <fct> Not.tea time, Not.tea time, tea time, Not.tea time, N…
## $ evening <fct> Not.evening, Not.evening, evening, Not.evening, eveni…
## $ lunch <fct> Not.lunch, Not.lunch, Not.lunch, Not.lunch, Not.lunch…
## $ dinner <fct> Not.dinner, Not.dinner, dinner, dinner, Not.dinner, d…
## $ always <fct> Not.always, Not.always, Not.always, Not.always, alway…
## $ home <fct> home, home, home, home, home, home, home, home, home,…
## $ work <fct> Not.work, Not.work, work, Not.work, Not.work, Not.wor…
## $ tearoom <fct> Not.tearoom, Not.tearoom, Not.tearoom, Not.tearoom, N…
## $ friends <fct> Not.friends, Not.friends, friends, Not.friends, Not.f…
## $ resto <fct> Not.resto, Not.resto, resto, Not.resto, Not.resto, No…
## $ pub <fct> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, Not.pub,…
## $ Tea <fct> black, black, Earl Grey, Earl Grey, Earl Grey, Earl G…
## $ How <fct> alone, milk, alone, alone, alone, alone, alone, milk,…
## $ sugar <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, No.sugar,…
## $ how <fct> tea bag, tea bag, tea bag, tea bag, tea bag, tea bag,…
## $ where <fct> chain store, chain store, chain store, chain store, c…
## $ price <fct> p_unknown, p_variable, p_variable, p_variable, p_vari…
## $ age <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 31, 56, 6…
## $ sex <fct> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M, M, M, F,…
## $ SPC <fct> middle, middle, other worker, student, employee, stud…
## $ Sport <fct> sportsman, sportsman, sportsman, Not.sportsman, sport…
## $ age_Q <fct> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35-44, 35-4…
## $ frequency <fct> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, 3 to 6/we…
## $ escape.exoticism <fct> Not.escape-exoticism, escape-exoticism, Not.escape-ex…
## $ spirituality <fct> Not.spirituality, Not.spirituality, Not.spirituality,…
## $ healthy <fct> healthy, healthy, healthy, healthy, Not.healthy, heal…
## $ diuretic <fct> Not.diuretic, diuretic, diuretic, Not.diuretic, diure…
## $ friendliness <fct> Not.friendliness, Not.friendliness, friendliness, Not…
## $ iron.absorption <fct> Not.iron absorption, Not.iron absorption, Not.iron ab…
## $ feminine <fct> Not.feminine, Not.feminine, Not.feminine, Not.feminin…
## $ sophisticated <fct> Not.sophisticated, Not.sophisticated, Not.sophisticat…
## $ slimming <fct> No.slimming, No.slimming, No.slimming, No.slimming, N…
## $ exciting <fct> No.exciting, exciting, No.exciting, No.exciting, No.e…
## $ relaxing <fct> No.relaxing, No.relaxing, relaxing, relaxing, relaxin…
## $ effect.on.health <fct> No.effect on health, No.effect on health, No.effect o…
# Visualize
tea %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap("key", scales = "free", ncol = 8) +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
ggtitle('Visualization of the "tea" data set')
## Warning: attributes are not identical across measure variables; they will be
## dropped
It looks like Age variable is not truly categorical.
Let’s remove it and conduct the MCA analysis:
# Drop problematic variable as it is not truly categorical
names(tea)[19] # Age
## [1] "age"
mca <- FactoMineR::MCA(tea[, -which(names(tea) == "age")], graph = FALSE)
summary(mca)
##
## Call:
## FactoMineR::MCA(X = tea[, -which(names(tea) == "age")], graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.090 0.082 0.070 0.063 0.056 0.053 0.050
## % of var. 5.838 5.292 4.551 4.057 3.616 3.465 3.272
## Cumulative % of var. 5.838 11.130 15.681 19.738 23.354 26.819 30.091
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.048 0.047 0.044 0.041 0.040 0.039 0.037
## % of var. 3.090 3.053 2.834 2.643 2.623 2.531 2.388
## Cumulative % of var. 33.181 36.234 39.068 41.711 44.334 46.865 49.252
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.036 0.035 0.034 0.032 0.031 0.031 0.030
## % of var. 2.302 2.275 2.172 2.085 2.013 2.011 1.915
## Cumulative % of var. 51.554 53.829 56.000 58.086 60.099 62.110 64.025
## Dim.22 Dim.23 Dim.24 Dim.25 Dim.26 Dim.27 Dim.28
## Variance 0.028 0.027 0.026 0.025 0.025 0.024 0.024
## % of var. 1.847 1.740 1.686 1.638 1.609 1.571 1.524
## Cumulative % of var. 65.872 67.611 69.297 70.935 72.544 74.115 75.639
## Dim.29 Dim.30 Dim.31 Dim.32 Dim.33 Dim.34 Dim.35
## Variance 0.023 0.022 0.021 0.020 0.020 0.019 0.019
## % of var. 1.459 1.425 1.378 1.322 1.281 1.241 1.222
## Cumulative % of var. 77.099 78.523 79.901 81.223 82.504 83.745 84.967
## Dim.36 Dim.37 Dim.38 Dim.39 Dim.40 Dim.41 Dim.42
## Variance 0.018 0.017 0.017 0.016 0.015 0.015 0.014
## % of var. 1.152 1.092 1.072 1.019 0.993 0.950 0.924
## Cumulative % of var. 86.119 87.211 88.283 89.301 90.294 91.244 92.169
## Dim.43 Dim.44 Dim.45 Dim.46 Dim.47 Dim.48 Dim.49
## Variance 0.014 0.013 0.012 0.011 0.011 0.010 0.010
## % of var. 0.891 0.833 0.792 0.729 0.716 0.666 0.660
## Cumulative % of var. 93.060 93.893 94.684 95.414 96.130 96.796 97.456
## Dim.50 Dim.51 Dim.52 Dim.53 Dim.54
## Variance 0.009 0.009 0.008 0.007 0.006
## % of var. 0.605 0.584 0.519 0.447 0.390
## Cumulative % of var. 98.060 98.644 99.163 99.610 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.580 1.246 0.174 | 0.155 0.098 0.012 | 0.052 0.013
## 2 | -0.376 0.522 0.108 | 0.293 0.350 0.066 | -0.164 0.127
## 3 | 0.083 0.026 0.004 | -0.155 0.099 0.015 | 0.122 0.071
## 4 | -0.569 1.196 0.236 | -0.273 0.304 0.054 | -0.019 0.002
## 5 | -0.145 0.078 0.020 | -0.142 0.083 0.019 | 0.002 0.000
## 6 | -0.676 1.693 0.272 | -0.284 0.330 0.048 | -0.021 0.002
## 7 | -0.191 0.135 0.027 | 0.020 0.002 0.000 | 0.141 0.095
## 8 | -0.043 0.007 0.001 | 0.108 0.047 0.009 | -0.089 0.038
## 9 | -0.027 0.003 0.000 | 0.267 0.291 0.049 | 0.341 0.553
## 10 | 0.205 0.155 0.028 | 0.366 0.546 0.089 | 0.281 0.374
## cos2
## 1 0.001 |
## 2 0.021 |
## 3 0.009 |
## 4 0.000 |
## 5 0.000 |
## 6 0.000 |
## 7 0.015 |
## 8 0.006 |
## 9 0.080 |
## 10 0.052 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## breakfast | 0.182 0.504 0.031 3.022 | 0.020 0.007 0.000 0.330 |
## Not.breakfast | -0.168 0.465 0.031 -3.022 | -0.018 0.006 0.000 -0.330 |
## Not.tea time | -0.556 4.286 0.240 -8.468 | 0.004 0.000 0.000 0.065 |
## tea time | 0.431 3.322 0.240 8.468 | -0.003 0.000 0.000 -0.065 |
## evening | 0.276 0.830 0.040 3.452 | -0.409 2.006 0.087 -5.109 |
## Not.evening | -0.144 0.434 0.040 -3.452 | 0.214 1.049 0.087 5.109 |
## lunch | 0.601 1.678 0.062 4.306 | -0.408 0.854 0.029 -2.924 |
## Not.lunch | -0.103 0.288 0.062 -4.306 | 0.070 0.147 0.029 2.924 |
## dinner | -1.105 2.709 0.092 -5.240 | -0.081 0.016 0.000 -0.386 |
## Not.dinner | 0.083 0.204 0.092 5.240 | 0.006 0.001 0.000 0.386 |
## Dim.3 ctr cos2 v.test
## breakfast -0.107 0.225 0.011 -1.784 |
## Not.breakfast 0.099 0.208 0.011 1.784 |
## Not.tea time 0.062 0.069 0.003 0.950 |
## tea time -0.048 0.054 0.003 -0.950 |
## evening 0.344 1.653 0.062 4.301 |
## Not.evening -0.180 0.864 0.062 -4.301 |
## lunch 0.240 0.343 0.010 1.719 |
## Not.lunch -0.041 0.059 0.010 -1.719 |
## dinner 0.796 1.805 0.048 3.777 |
## Not.dinner -0.060 0.136 0.048 -3.777 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.031 0.000 0.011 |
## tea.time | 0.240 0.000 0.003 |
## evening | 0.040 0.087 0.062 |
## lunch | 0.062 0.029 0.010 |
## dinner | 0.092 0.000 0.048 |
## always | 0.056 0.035 0.007 |
## home | 0.016 0.002 0.030 |
## work | 0.075 0.020 0.022 |
## tearoom | 0.321 0.019 0.031 |
## friends | 0.186 0.061 0.030 |
Plot MCA results:
# "Boring" biplot
# factoextra::fviz_mca_biplot(
# mca,
# ) + theme_darkmode()
# Contributions of variable levels to dimensions 1 and 2
figg <- \ (x) {
theme_darkmode() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
}
ggpubr::ggarrange(
factoextra::fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 20)) +
theme_darkmode(),
factoextra::fviz_contrib(mca, choice = "var", axes = 1, top = 15) + figg(),
factoextra::fviz_contrib(mca, choice = "var", axes = 2, top = 15) + figg(),
align = "hv",
ncol = 3
)
# Look at interesting variables
factoextra::fviz_ellipses(
mca, c("tearoom", "age_Q"),
geom = "point"
) + theme_darkmode()
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <]8;;https://github.com/kassambara/factoextra/issueshttps://github.com/kassambara/factoextra/issues]8;;>.
This MCA is trying to explain all the variables related to tea drinking in 300 participants. As there are a lot of variables, the proportion of variability that each dimension explains is relatively low, max. 5.8% for dimension 1. Distance between row points or column points in the MCA factor map tells about their similarity (or dissimilarity) – similar column and row points are close to each other.
The bar plot shows the contribution of variables to principal
dimensions. Variable tearoom has the strongest correlation
with dimension 1 and age_Q has the strongest correlation to
dimension 2.
# Run time of this chapter code
end.time <- Sys.time(); elapsed.time <- (end.time - start.time) %>% round()
elapsed.time
## Time difference of 22 secs
6 Analysis of longitudinal data
Setup
# Packages
pacman::p_load(tidyverse)
# Custom functions
source("src/fig_setup.R")
# Code run time
start.time <- Sys.time()
6.0 Get data
Automatically choose the newest version of each dataset.
# Read the most up-to-date data
datasets <- list.files("data/bprs_rats/ready", full.names = T)
newest <- datasets %>% sub('.*_', '', .) %>% as.Date() %>% max()
data <- datasets[grepl(newest, datasets) & grepl("long", datasets)] %>%
map(. %>% readr::read_csv(show_col_types = F))
BPRSL <- data[[1]] %>% mutate(
subject = factor(subject),
treatment = recode_factor(treatment, `1` = "Treatment 1", `2` = "Treatment 2")
)
RATSL <- data[[2]] %>% mutate(
ID = factor(ID),
Group = recode_factor(Group, `1` = "Diet 1", `2` = "Diet 2", `3` = "Diet 3")
)
# Uncomment to get metadata
# ?nlme::BodyWeight
We have two datasets.
RATSLdescribes three groups of rats that were put on different diets. Each animal’s body weight was recorded repeatedly over a 9-week period. The important question is whether the growth profiles of the three groups differ or not.BPRSLis from a parallel-arm human RCT in which 40 male subjects were randomized to one of two interventions. Participants were rated on the brief psychiatric rating scale (BPRS) at baseline and then at weekly intervals for 8 weeks.bprsassesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used for patients that may have schizophrenia.
6.1 RATS
Briefly checking the data structure.
str(RATSL)
## tibble [176 × 4] (S3: tbl_df/tbl/data.frame)
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Group : Factor w/ 3 levels "Diet 1","Diet 2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WD : num [1:176] 1 8 15 22 29 36 43 44 50 57 ...
## $ Weight: num [1:176] 240 250 255 260 262 258 266 266 265 272 ...
There are 16 rats divided to 3 diet groups. Group 1 is control group which probably gets typical chow diet. The composition of diets 2 and 3 remain a mystery to us…
# N of rats assigned to different diets
RATSL %>%
filter(WD == 1) %>%
{table(.$Group)}
##
## Diet 1 Diet 2 Diet 3
## 8 4 4
Graphical displays of data
# Plot individual rat data
RATSL %>%
ggplot(aes(x = WD, y = Weight, color = ID)) +
geom_line() +
theme(legend.position = "none") +
facet_grid(. ~ Group) +
scale_y_continuous( # Ounces for US folks :D
"Weight, g",
sec.axis = sec_axis(~ . / 28.35, breaks = seq(0, 100, 2), name = "Weight, oz.")
) + labs(title = "Rats' growth by diet", subtitle = "Hand and Crowder (1996)") + xlab("Time, days")
The control rats (Diet 1) seem to have
low weight to begin with and not much growth. Diet 2
appears to have higher baseline weight, more growth and a clear
outlier (rat n:o 12), which is much heavier than any other rat.
Rats on Diet 3 may have even higher baseline weight,
but the growth seems similar to rats on Diet 2.
Tracking
Lets standardize the Weight variable by group:
RATSL %>%
group_by(Group) %>% mutate(Weight_std = scale(Weight)[, 1]) %>% ungroup() %>%
ggplot(aes(x = WD, y = Weight_std, color = ID)) +
geom_hline(yintercept = 0) +
geom_line() +
theme(legend.position = "none") +
facet_grid(. ~ Group) +
ggtitle("Tracking is clearly present in the RATS data") + xlab("Time, days") + ylab("Standardized weight, SD")
There is considerable tracking: rats with higher baseline weight tend to have higher weight throughout the study.
Tendencies
Not-so-obvious patterns can be found by visualizing descriptive summaries of data.
dodge <- position_dodge(.5)
mypalette <- c("chartreuse", "purple1", "turquoise")
# Mean (95% confidence intervals)
p <- RATSL %>%
filter(WD != 44) %>%
ggplot(aes(x = WD / 7, y = Weight, color = Group))
p <- p + geom_point(
shape = 21, size = 1, alpha = .8,
position = position_jitterdodge(jitter.width = .3, dodge.width = .5)
)
p <- p + stat_summary(fun = mean, geom = "line", position = dodge)
p <- p + stat_summary(
fun.data = mean_cl_normal, geom = "errorbar", position = dodge, width = .3
)
p <- p + theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()
)
p <- p + scale_color_manual(values = mypalette)
p <- p + scale_x_continuous(name = "Time, weeks", breaks = seq(0, 10, by = 1))
p <- p + scale_y_continuous(name = "Weight, g")
p <- p + ggtitle("Mean (95% CI)")
p1 <- p
# Boxplot visualization
p <- RATSL %>%
filter(WD != 44) %>%
ggplot(aes(x = (WD / 7) %>% round() %>% factor, y = Weight, color = Group))
p <- p + geom_line(
aes(group = ID), linewidth = .2, alpha = .5
)
p <- p + geom_boxplot(fill = NA)
p <- p + theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.position = "none"
)
p <- p + scale_color_manual(values = mypalette)
p <- p + xlab("Time, weeks") + ylab("Weight, g")
p <- p + ggtitle("Boxplot")
p2 <- p
patchwork::wrap_plots(p1, p2) +
patchwork::plot_layout(guides = "collect") +
patchwork::plot_annotation(tag_levels = 'A')
Plot A shows that confidence intervals are wide in diet 2 due to large inter-rat variation. This might be because of an outlier identified in plot B (rat n:o 12). Other diets also have outliers (lighter rats) but these are not so extreme that they would considerably widen the CI’s. Group sizes are very small and it may affect the results if we further remove outliers.
Compare means
Now let’s remove the outlier from group 2 and make data frame that we can use to compare the changes between groups.
# Calculate average weights on the intervention
RATSLint <- RATSL %>%
filter(WD > 1) %>%
group_by(Group, ID) %>%
reframe(mean_int = mean(Weight)) %>%
mutate( # Add baseline weight
baseline = RATSL %>% filter(WD == 1) %>% pull(Weight),
delta = mean_int - baseline
)
Plot the data:
# Plot baseline and intervention means
p1 <- RATSLint %>%
filter(ID != 12) %>%
pivot_longer(
cols = c(baseline, mean_int),
names_to = "Timepoint",
values_to = "Weight"
) %>%
mutate(Timepoint = recode(
Timepoint,
"baseline" = "At baseline",
"mean_int" = "During intervention"
)) %>%
ggplot(aes(x = Timepoint, y = Weight, color = Group)) +
geom_point(
shape = 21,
position = position_jitterdodge(jitter.width = .2, dodge.width = .5)
) +
stat_summary(fun = mean, geom = "line", position = dodge, aes(group = Group)) +
stat_summary(
fun.data = mean_cl_normal, geom = "errorbar", position = dodge, width = .3
) +
scale_color_manual(values = mypalette) +
ggtitle("Actual values by timepoint")
# Delta variables
p2 <- RATSLint %>%
filter(ID != 12) %>%
ggplot(aes(x = Group, y = delta, color = Group, fill = Group)) +
stat_summary(
fun = mean, geom = "col", aes(group = Group), alpha = .5, width = .3
) +
geom_hline(yintercept = 0) +
stat_summary(
fun.data = mean_cl_normal, geom = "errorbar", width = .2, color = "white"
) +
geom_point(
position = position_jitterdodge(jitter.width = .2, dodge.width = .5)
) +
theme(legend.position = "none") +
scale_color_manual(values = mypalette) +
scale_fill_manual(values = mypalette) +
ylab("Change in weight") + ggtitle("Weight change")
# Combine plots
patchwork::wrap_plots(p1, p2) +
patchwork::plot_layout(widths = c(1, .8), guides = "collect") +
patchwork::plot_annotation(tag_levels = 'A')
Panel A shows that all diet groups exhibit some kind
of weight increase from baseline. This increase appears to be the
strongest on Diet 2. However, from panel B
we see that there may be too few rats on Diet 2 to draw
strong conclusions about their weight change, as the CI is very wide and
includes 0. We also cannot just look at delta variables because there is
massive differences in baseline weight between rat groups.
Significance
Let’s test the significance of these changes with linear regression
analysis (we’ll do mixed models with BPRS
data).
# Fit the model
# fit_full <- lm(mean_int ~ baseline + Group, data = RATSLint)
fit_excl <- lm(mean_int ~ baseline + Group, data = RATSLint %>% filter(ID != 12))
# fit_full %>% gtsummary::tbl_regression()
fit_excl %>% gtsummary::tbl_regression()
| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| baseline | 0.58 | 0.30, 0.85 | <0.001 |
| Group | |||
| Diet 1 | — | — | |
| Diet 2 | 89 | 41, 138 | 0.002 |
| Diet 3 | 113 | 41, 185 | 0.005 |
| 1 CI = Confidence Interval | |||
# See anova tables
# anova(fit_full)
anova(fit_excl)
## Analysis of Variance Table
##
## Response: mean_int
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 207817 207817 2714.9357 1.601e-14 ***
## Group 2 1483 742 9.6878 0.003748 **
## Residuals 11 842 77
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Confirm using repeated measures ANOVA
RATSL %>%
filter(ID != 12) %>%
rstatix::anova_test(
dv = Weight, wid = ID, within = WD, between = Group
) %>%
rstatix::get_anova_table()
## ANOVA Table (type III tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Group 2.00 12.00 483.599 3.39e-12 * 0.986
## 2 WD 1.85 22.24 54.668 4.24e-09 * 0.404
## 3 Group:WD 3.71 22.24 4.495 9.00e-03 * 0.100
The following results demonstrate that careful preprosessing is a critical part of data analysis. When considering all rats, linear regression analysis shows that there is no significant difference in weight change between groups. However, when removing the major outlier (rat n:o 12), both diets 2 (\(p = 0.002\)) and 3 (\(p = 0.005\)) differ from the control diet. This result is further confirmed by repeated measures ANOVA, which shows significant group × time interaction (= effect of time depends on diet group). To conclude, baseline weights differ significantly between diets and so do weight changes during the intervention.
6.2 BPRS
Briefly checking the data structure.
str(BPRSL)
## tibble [360 × 4] (S3: tbl_df/tbl/data.frame)
## $ treatment: Factor w/ 2 levels "Treatment 1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ week : num [1:360] 0 1 2 3 4 5 6 7 8 0 ...
## $ bprs : num [1:360] 42 36 36 43 41 40 38 47 51 58 ...
There are 40 male participants randomized equally to 2 interventions (\(n = 20\) each).
# N's
BPRSL %>%
filter(week == 0) %>%
{table(.$treatment)}
##
## Treatment 1 Treatment 2
## 20 20
Graphical displays
First it will be helpful to plot all data and observe any obvious big patterns.
# Pick nice colors
bprs_col <- \ (x) {scale_color_manual(values = c("turquoise2", "orange"))}
bprs_fil <- \ (x) {scale_fill_manual(values = c("turquoise2", "orange"))}
# Plot individual human data
p1 <- BPRSL %>%
ggplot(aes(x = week, y = bprs, group = subject, color = treatment)) +
geom_line(alpha = .3, linewidth = .2) +
stat_summary(fun = "mean", geom = "line", aes(group = treatment)) +
theme(legend.position = "none") +
facet_grid(. ~ treatment) +
labs(title = "BPRS scores across time", subtitle = "by treatment group") +
xlab("Time, weeks") +
bprs_col() + bprs_fil()
# Summarise a bit
p2 <- BPRSL %>%
ggplot(aes(
x = week, y = bprs,
group = subject, color = treatment, fill = treatment
)) +
stat_summary(fun = mean, geom = "line", aes(group = treatment)) +
stat_summary(
fun.data = mean_cl_normal,
geom = "ribbon", aes(group = treatment), color = NA, alpha = .2
) +
labs(title = "Considerable overlap", subtitle = "in 95% CIs of group means") +
xlab("Time, weeks") +
bprs_col() + bprs_fil()
# Combine plots
patchwork::wrap_plots(p1, p2) + patchwork::plot_layout(widths = c(2, 1))
Clearly there is a lot of variation between and within subjects. Both
treatments show decreasing mean bprs score across time.
Whether these trends are statistically significant or even differ
between treatments is yet unknown. However I suspect that both
treatments lower bprs score equally well. These questions
are best answered using linear mixed models.
Linear mixed models
First model will have random intercept for each
subject.
# Fit random intercept model
refmod <- lmerTest::lmer(
bprs ~ week + treatment + (1 | subject),
REML = FALSE,
data = BPRSL
)
anova(refmod)#; summary(refmod)
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## week 12371.5 12371.5 1 340 118.7136 <2e-16 ***
## treatment 29.5 29.5 1 340 0.2828 0.5952
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Next, random slopes will also be fitted in addition to random intercepts. This may enable better fit for the model.
# Fit random slope & intercept model
refmod2 <- lmerTest::lmer(
bprs ~ week + treatment + (week | subject),
data = BPRSL, REML = FALSE
)
anova(refmod2)#; summary(refmod2)
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## week 5666.0 5666.0 1 20 58.1540 2.424e-07 ***
## treatment 29.5 29.5 1 320 0.3025 0.5827
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare the models
anova(refmod, refmod2)
## Data: BPRSL
## Models:
## refmod: bprs ~ week + treatment + (1 | subject)
## refmod2: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## refmod 5 2748.7 2768.1 -1369.4 2738.7
## refmod2 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Indeed it seems like random slopes improved the model chi-squared
statistics and log-likelihood test based on our
Pr(>Chisq) < 0.05.
Will interaction term between week and
treatment further improve the model? This is worth trying
especially as group × time interaction is probably the main question
of the whole study design.
# Fit random slope & intercept model with interaction
refmod3 <- lmerTest::lmer(
bprs ~ week * treatment + (week | subject),
data = BPRSL, REML = FALSE
)
anova(refmod3)#; summary(refmod2)
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## week 5610.7 5610.7 1 20 58.1598 2.421e-07 ***
## treatment 138.9 138.9 1 320 1.4403 0.23097
## week:treatment 307.5 307.5 1 320 3.1870 0.07517 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare the models
anova(refmod2, refmod3)
## Data: BPRSL
## Models:
## refmod2: bprs ~ week + treatment + (week | subject)
## refmod3: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## refmod2 7 2745.4 2772.6 -1365.7 2731.4
## refmod3 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Including the interaction no longer (clearly) improved our model. However, this may still be the best / a right way to look at the data, based on the study design and research question: will the change in bprs score depend on treatment? To note, the interaction is just barely shy of significance (\(p = 0.075\))!
Finally I want to test a model with interaction term and random intercept only, as it appears to be the most common practice in my field (dietary intervention studies).
# Fit random intercept model with interaction
refmod4 <- lmerTest::lmer(
bprs ~ week * treatment + (1 | subject),
data = BPRSL, REML = FALSE
)
anova(refmod4)#; summary(refmod2)
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## week 12371.5 12371.5 1 340 119.753 < 2e-16 ***
## treatment 138.9 138.9 1 340 1.345 0.24697
## week:treatment 307.5 307.5 1 340 2.976 0.08541 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare the models
anova(refmod3, refmod4)
## Data: BPRSL
## Models:
## refmod4: bprs ~ week * treatment + (1 | subject)
## refmod3: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## refmod4 6 2747.8 2771.1 -1367.9 2735.8
## refmod3 8 2744.3 2775.4 -1364.1 2728.3 7.4802 2 0.02375 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now our model got actually worse, and model 3 with the interaction
term and random slope & intercept remains superior. The
between-treatment difference in bprs change
(week:treatment interaction term) still doesn’t reach
significance, although it is pretty close!
We can compare the original values to fitted ones.
Fitted <- fitted(refmod3)
BPRSL <- BPRSL %>% mutate(fitted = Fitted)
# Plot individual human data
p1 <- BPRSL %>%
pivot_longer(
cols = c(bprs, fitted),
names_to = "type",
values_to = "bprs"
) %>%
mutate(type = recode_factor(type, "bprs" = "Observed", "fitted" = "Fitted")) %>%
ggplot(aes(x = week, y = bprs, group = subject, color = treatment)) +
geom_line(alpha = .5, linewidth = .2) +
stat_summary(fun = mean, geom = "line", aes(group = treatment)) +
stat_summary(
fun.data = mean_cl_normal,
geom = "errorbar", aes(group = treatment),
width = .2
) +
theme(legend.position = "none") +
facet_grid(treatment ~ type) +
labs(title = "Observed and fitted BPRS scores") + xlab("Time, weeks") + bprs_col()
# Head-to-head comparison
p2 <- BPRSL %>%
pivot_longer(
cols = c(bprs, fitted),
names_to = "type",
values_to = "bprs"
) %>%
mutate(type = recode_factor(type, "bprs" = "Observed", "fitted" = "Fitted")) %>%
ggplot(aes(x = week, y = bprs, group = subject, color = treatment)) +
stat_summary(
fun = mean,
geom = "line",
aes(group = treatment),
position = dodge
) +
stat_summary(
fun = mean, geom = "point", aes(group = treatment), position = dodge
) +
stat_summary(
fun.data = mean_cl_normal, alpha = .3,
geom = "linerange", aes(group = treatment), position = dodge
) +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()
) +
facet_grid(. ~ type) +
labs(title = "More explicit comparison between treatments") + xlab("Time, weeks") + bprs_col()
patchwork::wrap_plots(p1, p2) + patchwork::plot_layout(widths = c(1, 1))
The fitted values of this model (3) seem to describe the original data quite good. The model allows for both intercept and slope randomness so that an optimally “personalized” (albeit linear) change is modeled for each participant. In the final 2 weeks the benefits seem to plateau, which is not captured by linear mixed models. Although this study did not demonstrate clear (significant) difference between treatments (\(p = 0.075\)), treatment 1 might hold promise to be slightly superior.
# Run time of this chapter code
end.time <- Sys.time(); elapsed.time <- (end.time - start.time) %>% round()
elapsed.time
## Time difference of 9 secs
Thanks for the great course!